#use the fixed cohorts
new_cohort_assignment <- TRUE
#use the new fixed genomes
use_fixed_genotypes <- TRUE
#adjust these if you want the relevant chunks to run
rerun_COLONY <- FALSE
rerun_COLONY_all <- FALSE
rerun_LDNE <- TRUE
include_2014 <- TRUE #will still run 2014 estimates, just not included in the analysis
#p_crit choices
p_crit_list <- c(0, 0.02, 0.05)
p_crit_chosen <- 0.05
#COLONY params for simulations that will actually be used
COLONY_run_length <- 2
COLONY_allele_freq <- NULL # vs "overall" for population averages
COLONY_dropout <- 0
COLONY_error_rate <- 0.02
COLONY_random_seed <- 1
#using old data for cohorts and genotypes, no 2014, pcrit 2014, defaults for COLONY should regenerate the original published results exactly
This document/repository contains and outlines the re-analysis of the data presented in Davenport et al 2021 Davenport et al. (2021), as re-re-analysed by Andrew Jones.
The authors were made aware of a bug or error in the output-files used in both NeEstimator and Colony to estimate the number of breeders per cohort. This error caused the data to become shuffled. Below, analysis is presented comparing the published data to files made directly from the original raw data. The input files for Nb estimation for NeEstimator and COLONY are re-made using raw data (based on the published loci) and the updated results are presented. The published work further sought to combine estimators of effective size to make reporting and decision making easier - combined estimates of Nb per-cohort. Here using the revised results we also make these re-estimates.
An additional sensitivity analysis is performed on the full dataset to illustrate how the inclusion of different SNPs and individuals can affect final point-estimates of effective size, and highlights the importance of considering confidence intervals (i.e using a 95%CI we expect that 95% of the time the true value of the parameter lies in its interval), as well as various estimators in conservation decisions. S
In Davenport et al., 2021, the effective number of breeders (Nb) was estimated in 4 consecutive cohorts (2010-2013) based on the premise that when a genetic sample contains only individuals from a single age cohort (a group of individuals having the same age-class), then the estimate of effective population size (Ne) from a single-sample estimator in species with overlapping generations corresponds to the effective number of breeders (Nb). For long-lived, iteroparous species such as the white shark, estimates of Nb are generally considered more useful for monitoring as they apply to a single breeding season (rather than needing a sample representative of a generation - an assumption of Ne (see Waples (2022))), and it represents an accessible parameter for monitoring population trends at ecological timescales most relevant to conservation and management needs. Nb is particularly useful in cases where juvenile or YOY samples can be easily collected (i.e. through SMART shark program NSW). Estimating Nb and monitoring its change over time allows timely detection of population trends (decline, restoration, recovery, expansion), even if using as few as four or five consecutive reproductive cycles (Leberg (2005); Wang (2005); Luikart et al. (2021)).
The original DArT data for this project can be found in the file /Data/Raw
The results of the reanalysis can be found in /Results.
The filtered list of loci from the published analysis can be found in Data/Raw/Report-DSha18-3402/File_Formats/Genepop_NeEstimator_Input/nswdpi_whiteshark_gp.gen.
The loci names were extracted from this document to ensure the data from the publication are used in the reanalysis here. Readers are referred to the original publication Davenport et al. (2021) for more detailed methods. In summary, a list of loci/SNPs which meet the articles filtering criteria are saved in the ‘.gen’ file. These are used here to re-make the input files for NeEstimator and COLONY from the original raw-data, from which Nb per cohort is re-estimated.The full raw data is also used to re-estimate Nb using a sensitivity analysis of loci/individuals.
The processed Data is found in /Data/processed.
For this section I follow the approach used by Dani as far as was required to confirm the work of Dean and Paul.
A bug in an R-package used in the original publication is suspected to have introduced inconsistencies into the output files used in the final analysis.
# read in published strata
strata_info = read.table("Original Project Files/Strata_Sex_Model_COHORT_REGRESSION.tsv", header = T) # used in the published ms
# read in published genepop file affected by bug - herein refered to as 'orig-genepop'
orig_gp <- adegenet::read.genepop(file = "Original Project Files/File_Formats/Genepop_NeEstimator_Input/nswdpi_whiteshark_gp.gen", ncode = 3L, quiet = TRUE)
orig_gl <- dartR::gi2gl(orig_gp)
## Starting ::
## Starting dartR
## Starting gi2gl
## Starting gl.compliance.check
## Processing genlight object with SNP data
## The slot loc.all, which stores allele name for each locus, is empty.
## Creating a dummy variable (A/C) to insert in this slot.
## Checking coding of SNPs
## SNP data scored NA, 0, 1 or 2 confirmed
## Checking locus metrics and flags
## Recalculating locus metrics
## Checking for monomorphic loci
## Dataset contains monomorphic loci
## Checking for loci with all missing data
## No loci with all missing data detected
## Checking whether individual names are unique.
## Checking for individual metrics
## Warning: Creating a slot for individual metrics
## Checking for population assignments
## Population assignments confirmed
## Spelling of coordinates checked and changed if necessary to
## lat/lon
## Completed: gl.compliance.check
## Completed: ::
## Completed: dartR
## Completed: gi2gl
# it didnt fix the names, so do it here
orig_gl@loc.names<- sub(".*?-", "", str_replace_all(orig_gl@loc.names,"__","-")) # change whitelist loci seperators so they are in the fornat needed to match the genlight obj
# 'new_genepop'
# read in orgional data, and use the list of loci to make a new dataset - herein refered to as 'new-genepop'
input_gl = dartR::gl.read.dart("Data/Raw/Report_DSha18-3402_SNP_2_ReLabeled.csv")
## Starting ::
## Starting dartR
## Starting gl.read.dart
## Starting utils.read.dart
## Topskip not provided.
## Setting topskip to 6 .
## Reading in the SNP data
## Detected 2 row format.
## Number of rows per clone (should be only 2 s): 2
## Added the following locus metrics:
## AlleleID AlleleSequence TrimmedSequence Chrom_Shark_Callorhinchus_v1 ChromPos_Shark_Callorhinchus_v1 AlnCnt_Shark_Callorhinchus_v1 AlnEvalue_Shark_Callorhinchus_v1 SNP SnpPosition CallRate OneRatioRef OneRatioSnp FreqHomRef FreqHomSnp FreqHets PICRef PICSnp AvgPIC AvgCountRef AvgCountSnp RepAvg .
## Recognised: 278 individuals and 9841 SNPs in a 2 row format using Data/Raw/Report_DSha18-3402_SNP_2_ReLabeled.csv
## Completed: utils.read.dart
## Starting utils.dart2genlight
## Starting conversion....
## Format is 2 rows.
## Please note conversion of bigger data sets will take some time!
## Once finished, we recommend to save the object using save(object, file="object.rdata")
## Completed: utils.dart2genlight
## Data read in. Please check carefully the output above
## Read depth calculated and added to the locus metrics
## Minor Allele Frequency (MAF) calculated and added to the locus metrics
## Recalculating locus metrics provided by DArT (optionally specified)
## Completed: ::
## Completed: dartR
## Completed: gl.read.dart
# change the GL loci names so they are in a suitable format
input_gl@loc.names <- sub("^([^-]*-[^-]*).*", "\\1", input_gl@loc.names)
# filter the gl to have the same loci as used in 'orig-gp'
order_loci <- sapply(orig_gl@loc.names, function(x,order){which(order == x)}, order = input_gl@loc.names)
new_gl <- input_gl[,order_loci]
# dartR has a bug where it doesnt fix up other information stored in other
dim(new_gl@other$loc.metrics)
## [1] 3668 24
new_gl@other$loc.metrics <- new_gl@other$loc.metrics[order_loci,]
# match so the files have the same samples, in the same order
order_ind <- sapply(orig_gl@ind.names, function(x,order){which(order == x)}, order = str_replace_all(new_gl@ind.names, "_", "-"))
new_gl <- new_gl[order_ind,]
# check
dim(new_gl@other$loc.metrics)
## [1] 3668 24
# add the population information, also make sure its in the right order
pop = dplyr::left_join(tibble::tibble(TARGET_ID=str_replace_all(new_gl@ind.names, "_", "-")), strata_info, by = "TARGET_ID")
new_gl@pop<- factor(pop$STRATA)
# check
which(head(str_replace_all(new_gl@ind.names, "_", "-")) == head(orig_gl@ind.names))
## [1] 1 2 3 4 5 6
which(head(orig_gl@loc.names) == head(new_gl@loc.names))
## [1] 1 2 3 4 5 6
# compare the old and the new files for missingness
# print out each genlight to get info on missngness, loci, invidual
orig_gl
## /// GENLIGHT OBJECT /////////
##
## // 241 genotypes, 3,668 binary SNPs, size: 1.2 Mb
## 4506 (0.51 %) missing data
##
## // Basic content
## @gen: list of 241 SNPbin
## @ploidy: ploidy of each individual (range: 2-2)
##
## // Optional content
## @ind.names: 241 individual labels
## @loc.names: 3668 locus labels
## @loc.all: 3668 alleles
## @pop: population of each individual (group size range: 1-63)
## @other: a list containing: loc.metrics loc.metrics.flags verbose history ind.metrics
# compare the old and the new files for missingness
# print out each genlight to get info on missngness, loci, invidual
new_gl
## ********************
## *** DARTR OBJECT ***
## ********************
##
## ** 241 genotypes, 3,668 SNPs , size: 5.7 Mb
##
## missing data: 4503 (=0.51 %) scored as NA
##
## ** Genetic data
## @gen: list of 241 SNPbin
## @ploidy: ploidy of each individual (range: 2-2)
##
## ** Additional data
## @ind.names: 241 individual labels
## @loc.names: 3668 locus labels
## @loc.all: 3668 allele labels
## @position: integer storing positions of the SNPs [within 69 base sequence]
## @pop: population of each individual (group size range: 1-63)
## @other: a list containing: loc.metrics, ind.metrics, loc.metrics.flags, verbose, history
## @other$ind.metrics: service, plate_location
## @other$loc.metrics: AlleleID, AlleleSequence, TrimmedSequence, Chrom_Shark_Callorhinchus_v1, ChromPos_Shark_Callorhinchus_v1, AlnCnt_Shark_Callorhinchus_v1, AlnEvalue_Shark_Callorhinchus_v1, SNP, SnpPosition, CallRate, OneRatioRef, OneRatioSnp, FreqHomRef, FreqHomSnp, FreqHets, PICRef, PICSnp, AvgPIC, AvgCountRef, AvgCountSnp, RepAvg, clone, uid, rdepth
## @other$latlon[g]: no coordinates attached
These results show the number of individuals and the number of loci does not change between the published and the new dataset. However, the number of missing data in the published dataset is 4506, while the re-analysis dataset missingingess is 4503.
# visualize the data
ogplot = adegenet::glPlot(orig_gl, posi="topleft")
newplot = adegenet::glPlot(new_gl, posi="topleft")
More details available in Dani’s report and Dean and Paul’s report.
Main points:
The new cohort assignment file “Data/Processed/New_Cohort_Assignment.tsv” was created by running the script Scripts/Age_at_Length.R.
The von Bertanlaffy parameters used are from are from O’Connor (2011).
meta_data <- readr::read_csv("Data/Raw/Copy of UPDTAED LAT LONGS UQ genetics PhD181018 Paul Butcher NSW DPI.csv") %>%
#select(MBB_Code, FL, TL) %>%
mutate(date_column = as.Date(`Date tagged`, format="%d/%m/%y"),
modified_year = year(date_column))
# 1. relationship between TL and FL
lm1 <- lm(meta_data$TL ~ meta_data$FL)
Tl_intercept <- lm1$coefficients[1]
Tl_slope <- lm1$coefficients[2]
#
Tl_intercept <- floor(Tl_intercept * 100) / 100
Tl_slope <- floor(Tl_slope * 100) / 100 # eq (1) manuscript , round down to two decimal places
TL <- sapply(meta_data$FL, function(x) {Tl_intercept + x * Tl_slope}) # eq (1) manuscript
# Estimate age directly using values from O'Connor 2011 and FL
#solve for age
vbgf3 <- function(L, L_infinity, K_value, t_0) {
result <- t_0 - (1 / K_value) * log(1 - (L / L_infinity))
return(result)
}
# Estimating age at length
# info:
# Best fitting parmaters for Males
Linf <- 798.94# cm TL
k <- 0.047 #
L0 <- 140 #cm
T0 <- -3.8 #years
Males <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)
# Best fitting paramters for Females
Linf<- 719.02# cm TL
k<- 0.056 #
L0<- 140 #cm
T0 <- -3.8 #years
Females <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)
Linf<- 746.66# cm TL
k<- 0.053#
L0<- 140 #cm
T0 <- -3.8 #years
Combined <- list(L_infinity_value = Linf, K_value = k, L0 = L0, t_0_value = T0)
# female
idx_f<- which(meta_data$Sex == "F")
idx_m<- which(meta_data$Sex == "M")
male_t = sapply(TL[idx_m], function(x) {vbgf3(x, Males[["L_infinity_value"]], Males[["K_value"]], Males[["t_0_value"]])})
female_t = sapply(TL[idx_f], function(x) {vbgf3(x, Females[["L_infinity_value"]], Females[["K_value"]], Females[["t_0_value"]])})
# Cohort is equal to year sampled minus age
male_age = trunc(meta_data$modified_year[idx_m] - male_t)
female_age = trunc(meta_data$modified_year[idx_f] - female_t) # including some samples, ie. MBB1348 depends on how you round the age or the year
output = tibble::tibble(STRATA = c(male_age, female_age), MBB_CODE = c(meta_data$MBB_Code[idx_m], meta_data$MBB_Code[idx_f]))
readr::write_tsv(output, "Data/Processed/New_Cohort_Assignment.tsv")
Below the input files for NeEstimator and COLONY are remade to make new estimates of Nb using a new file made from the published list of loci (“new_gp”).
dartR is used write out a genepop file, and a custom script to write out a COLONY file.
# 1. write out new_gp as a genepop , check that the file has been written out correctly
# 1a. run this file through NeEstimator
# first its important to make sure that the cohort are correctly assigned
#some sharks no cohort - why ?no meta data?
if(new_cohort_assignment){
cohort_assignment_file <- "Data/Processed/New_Cohort_Assignment.tsv"
cohort_assignment = readr::read_tsv(cohort_assignment_file) %>% #New cohort assignment
separate(MBB_CODE, into = c("string1", "string2"), sep = " ", remove = FALSE) %>%
unite(MBB_CODE_JOIN, string1, string2, sep = "_")
}else{
cohort_assignment_file <- "Original Project Files/Strata_Sex_Model_COHORT_REGRESSION.tsv"
cohort_assignment = readr::read_tsv(cohort_assignment_file) %>% #New cohort assignment
separate(INDIVIDUALS, into = c("string1", "string2"), sep = "-", remove = FALSE) %>%
unite(MBB_CODE_JOIN, string1, string2, sep = "_")
}
if(use_fixed_genotypes){
gl_file <- new_gl
}else{
gl_file <- orig_gl
}
gl_file$ind.names <- str_replace(gl_file$ind.names, "-", "_")
ind_order = tibble::tibble(MBB_CODE_JOIN = gl_file$ind.names, OLD_STRATA = gl_file@pop) %>%
left_join(., cohort_assignment, by = "MBB_CODE_JOIN")
gl_file@pop <- as.factor(ind_order$STRATA)
table(gl_file@pop)
##
## 2005 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 1 3 6 11 29 41 52 64 23 9 2
dartR::gl2genepop(gl_file, outfile = "genotypes_for_Ne.gen", outpath = "Data/Processed/NeEstimator")
## Starting ::
## Starting dartR
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: Data/Processed/NeEstimator/genotypes_for_Ne.gen/
## Completed: ::
## Completed: dartR
## Completed: gl2genepop
#need to read COLONY data in from files
colony_results_files <- list.files("./Results/COLONY/")
col_list <- list()
for (i in seq_len(length(colony_results_files))){
c_r_f <- colony_results_files[i]
res <- readLines(file.path("./Results/COLONY/", c_r_f))
temp <- str_match(c_r_f, "^COLONY_output_(scenario_[a-zA-Z0-9]+)_([0-9]{4}).txt$")
Year <- as.numeric(temp[3])
Scenario <- temp[2]
Nb_SA_est <- as.numeric(na.omit(str_match(res, "^Ne\\s+=\\s+([0-9]+)$")[, 2])[1])
Nb_SA_upper <- as.numeric(na.omit(str_match(res, "^CI95\\(U\\)\\s+=\\s+([0-9]+)$")[, 2])[1])
Nb_SA_lower <- as.numeric(na.omit(str_match(res, "^CI95\\(L\\)\\s+=\\s+([0-9]+)$")[, 2])[1])
col_list[[i]] <- data.frame(Year, Scenario, Nb_SA_est, Nb_SA_lower, Nb_SA_upper)
}
All_COLONY_data <- bind_rows(col_list) %>% filter(Year >= 2010, Year <= 2014)
write_csv(All_COLONY_data, "./Results/COLONY/ResultsSummary.csv")
Scenario List:
ggplot(All_COLONY_data, aes(x = Year, y = Nb_SA_est, colour = Scenario, group=Scenario)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_SA_lower, ymax = Nb_SA_upper), width = width, position =dodge)
All_COLONY_data %>% filter(Scenario %in% c("scenario_00", "scenario_01", "scenario_02", "scenario_03","scenario_08", "scenario_final")) %>% ggplot( aes(x = Year, y = Nb_SA_est, colour = Scenario, group=Scenario)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_SA_lower, ymax = Nb_SA_upper), width = width, position =dodge)
Nb_SA_data <- All_COLONY_data %>% filter(Scenario == "scenario_final") #need to pick one(1) for example scenario_00 is the defaults with 'quick run', scenario_08 is the original settings
write_csv(Nb_SA_data, "./Results/COLONY/ResultsSummary_final.csv")
Estimates slightly lower for some years if use population allele frequencies, otherwise basically same. No real reason to deviate from original settings.
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2010_MBB`
## Statistic Frequency 1
## Lowest Allele Frequency Used 0+
## Harmonic Mean Sample Size 28.8
## Independent Comparisons 4639447
## OverAll r^2 0.041995
## Expected r^2 Sample 0.03859
## Estimated Ne^ 88.7
## CI low Parametric 87.3
## CI high Parametric 90.2
## CI low JackKnife 48.5
## CI high JackKnife 332.6
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2011_MBB`
## Statistic Frequency 1
## Lowest Allele Frequency Used 0+
## Harmonic Mean Sample Size 40.4
## Independent Comparisons 5352795
## OverAll r^2 0.027775
## Expected r^2 Sample 0.026659
## Estimated Ne^ 296.6
## CI low Parametric 288
## CI high Parametric 305.8
## CI low JackKnife 160.1
## CI high JackKnife 1498.1
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2012_MBB`
## Statistic Frequency 1
## Lowest Allele Frequency Used 0+
## Harmonic Mean Sample Size 51.5
## Independent Comparisons 5663913
## OverAll r^2 0.021786
## Expected r^2 Sample 0.020592
## Estimated Ne^ 277.2
## CI low Parametric 271.3
## CI high Parametric 283.2
## CI low JackKnife 166.7
## CI high JackKnife 735
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2013_MBB`
## Statistic Frequency 1
## Lowest Allele Frequency Used 0+
## Harmonic Mean Sample Size 63.2
## Independent Comparisons 6045258
## OverAll r^2 0.018056
## Expected r^2 Sample 0.016599
## Estimated Ne^ 226.6
## CI low Parametric 223.5
## CI high Parametric 229.9
## CI low JackKnife 153.3
## CI high JackKnife 411
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2014_MBB`
## Statistic Frequency 1
## Lowest Allele Frequency Used 0+
## Harmonic Mean Sample Size 22.9
## Independent Comparisons 4454940
## OverAll r^2 0.051739
## Expected r^2 Sample 0.04988
## Estimated Ne^ 163.9
## CI low Parametric 158.1
## CI high Parametric 170.2
## CI low JackKnife 75.9
## CI high JackKnife Inf
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2010_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.020 0+
## Harmonic Mean Sample Size 28.8 28.8
## Independent Comparisons 3433638 4639447
## OverAll r^2 0.042932 0.041995
## Expected r^2 Sample 0.038558 0.03859
## Estimated Ne^ 68.7 88.7
## CI low Parametric 67.7 87.3
## CI high Parametric 69.7 90.2
## CI low JackKnife 37.1 48.5
## CI high JackKnife 252.1 332.6
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2011_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.020 0+
## Harmonic Mean Sample Size 40.5 40.4
## Independent Comparisons 4137761 5352795
## OverAll r^2 0.027992 0.027775
## Expected r^2 Sample 0.026634 0.026659
## Estimated Ne^ 243.4 296.6
## CI low Parametric 236.7 288
## CI high Parametric 250.5 305.8
## CI low JackKnife 130 160.1
## CI high JackKnife 1272.5 1498.1
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2012_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.020 0+
## Harmonic Mean Sample Size 51.5 51.5
## Independent Comparisons 3691581 5663913
## OverAll r^2 0.022145 0.021786
## Expected r^2 Sample 0.020591 0.020592
## Estimated Ne^ 212.3 277.2
## CI low Parametric 208 271.3
## CI high Parametric 216.8 283.2
## CI low JackKnife 128.1 166.7
## CI high JackKnife 540.5 735
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2013_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.020 0+
## Harmonic Mean Sample Size 63.3 63.2
## Independent Comparisons 4167409 6045258
## OverAll r^2 0.018346 0.018056
## Expected r^2 Sample 0.016576 0.016599
## Estimated Ne^ 186.3 226.6
## CI low Parametric 183.7 223.5
## CI high Parametric 189 229.9
## CI low JackKnife 126.3 153.3
## CI high JackKnife 333.1 411
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2014_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.020 0+
## Harmonic Mean Sample Size 22.9 22.9
## Independent Comparisons 4454940 4454940
## OverAll r^2 0.051739 0.051739
## Expected r^2 Sample 0.04988 0.04988
## Estimated Ne^ 163.9 163.9
## CI low Parametric 158.1 158.1
## CI high Parametric 170.2 170.2
## CI low JackKnife 75.9 75.9
## CI high JackKnife Inf Inf
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2010_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.050 0+
## Harmonic Mean Sample Size 28.9 28.8
## Independent Comparisons 2469334 4639447
## OverAll r^2 0.043427 0.041995
## Expected r^2 Sample 0.038533 0.03859
## Estimated Ne^ 61.2 88.7
## CI low Parametric 60.2 87.3
## CI high Parametric 62.2 90.2
## CI low JackKnife 31.9 48.5
## CI high JackKnife 257.4 332.6
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2011_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.050 0+
## Harmonic Mean Sample Size 40.5 40.4
## Independent Comparisons 2167338 5352795
## OverAll r^2 0.02822 0.027775
## Expected r^2 Sample 0.026616 0.026659
## Estimated Ne^ 205.7 296.6
## CI low Parametric 199 288
## CI high Parametric 212.8 305.8
## CI low JackKnife 109.8 160.1
## CI high JackKnife 1023.9 1498.1
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2012_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.050 0+
## Harmonic Mean Sample Size 51.6 51.5
## Independent Comparisons 2092733 5663913
## OverAll r^2 0.022087 0.021786
## Expected r^2 Sample 0.02055 0.020592
## Estimated Ne^ 214.8 277.2
## CI low Parametric 208.9 271.3
## CI high Parametric 220.9 283.2
## CI low JackKnife 134 166.7
## CI high JackKnife 489.2 735
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2013_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.050 0+
## Harmonic Mean Sample Size 63.4 63.2
## Independent Comparisons 2091127 6045258
## OverAll r^2 0.018472 0.018056
## Expected r^2 Sample 0.016556 0.016599
## Estimated Ne^ 171.9 226.6
## CI low Parametric 168.8 223.5
## CI high Parametric 175.2 229.9
## CI low JackKnife 117.3 153.3
## CI high JackKnife 301.7 411
##
## Starting gl2genepop
## Processing genlight object with SNP data
## The genepop file is saved as: /tmp/RtmpSfUVyo/dummy.gen/
## Completed: gl2genepop
## $`2014_MBB`
## Statistic Frequency 1 Frequency 2
## Lowest Allele Frequency Used 0.050 0+
## Harmonic Mean Sample Size 22.9 22.9
## Independent Comparisons 1950574 4454940
## OverAll r^2 0.052769 0.051739
## Expected r^2 Sample 0.049838 0.04988
## Estimated Ne^ 103.3 163.9
## CI low Parametric 99.7 158.1
## CI high Parametric 107.2 170.2
## CI low JackKnife 48.8 75.9
## CI high JackKnife 10445.4 Inf
Nb_LD_data_all <- read_csv("./Results/NeEstimator/NeResultsSummary.csv")
Nb_LD_data_all$p_crit <- factor(Nb_LD_data_all$p_crit)
ggplot(Nb_LD_data_all, aes(x = Year, y = Nb_LD_est, colour = p_crit, group=p_crit)) + geom_point(position =dodge) + geom_errorbar(aes(ymin = Nb_LD_lower, ymax = Nb_LD_upper), width = width, position =dodge) + coord_cartesian(ylim = c(0,1000)) + theme_bw()
Nb_LD_data <- Nb_LD_data_all %>% filter(p_crit == p_crit_chosen) #need to pick one(1) 0.05 is the original settings
write_csv(Nb_LD_data, "./Results/NeEstimator/NeResultsSummary_final.csv")
#read in from file so always works even if not a full rerun
Nb_LD_data <- read_csv( "./Results/NeEstimator/NeResultsSummary_final.csv")
Nb_LSA_data <- read_csv( "./Results/COLONY/ResultsSummary_final.csv")
#join and save
data <- left_join(Nb_LD_data, Nb_SA_data, by="Year")
write_csv(data, "./Results/ResultsSummary.csv")
data <- read_csv( "./Results/ResultsSummary.csv")
Ne_to_r <- function(Ne, S){
r <- (-69*S^2 + sqrt(10000 * (S^4) * (Ne^2) + 4761 * (S^4) - 248400*(S^3)*Ne) + 1800*S*(Ne^2))/( 1800*(S^2)*(Ne^2))
return(r)
}
Ne_to_r_star <- function(Ne, S){
r <- Ne_to_r(Ne, S)
r <- r - 1/S
return(r)
}
data$Var_SA_lower <- 0.25 * data$Nb_SA_est^4 * ((1 / data$Nb_SA_lower) - (1 / data$Nb_SA_est))^2
data$Var_SA_upper <- 0.25 * data$Nb_SA_est^4 * ((1 / data$Nb_SA_upper) - (1 / data$Nb_SA_est))^2
data$Var_SA_mean <- ( data$Var_SA_lower + data$Var_SA_upper ) / 2
data$Var_SA_weight <- 1 / data$Var_SA_mean
data$r_LD_lower <- Ne_to_r(data$Nb_LD_lower, data$S)
data$r_LD_upper <- Ne_to_r(data$Nb_LD_upper, data$S)
data$r_LD_est <- Ne_to_r(data$Nb_LD_est, data$S)
data$rstar_LD_lower <- Ne_to_r_star(data$Nb_LD_lower, data$S)
data$rstar_LD_upper <- Ne_to_r_star(data$Nb_LD_upper, data$S)
data$Var_LD_lower <- (1/36) * data$r_star^-4 * (data$r_star - data$rstar_LD_lower)^2
data$Var_LD_upper <- (1/36) * data$r_star^-4 * (data$r_star - data$rstar_LD_upper)^2
data$Var_LD_mean <- ( data$Var_LD_lower + data$Var_LD_upper ) / 2
data$Var_LD_weight <- 1 / data$Var_LD_mean
data$weight_sum <- data$Var_LD_weight + data$Var_SA_weight
data$Var_LD_weightadj <- data$Var_LD_weight / data$weight_sum
data$Var_SA_weightadj <- data$Var_SA_weight / data$weight_sum
data$Nb_COM_est <- 1 / ((data$Var_SA_weightadj /data$Nb_SA_est) + (data$Var_LD_weightadj / data$Nb_LD_est))
data$var_COM <- 1 / ((1/data$Var_LD_mean) + (1/data$Var_SA_mean))
#Not CIs
data$Nb_COM_lower <- data$Nb_COM_est - sqrt(data$var_COM)
data$Nb_COM_upper <- data$Nb_COM_est + sqrt(data$var_COM)
#nice table
data %>% dplyr::select(Year, starts_with("Nb_")) -> data_est_only
if(!include_2014){
data_est_only <- data_est_only %>% filter(Year != 2014)
}else{
data_est_only <- data_est_only
}
data_est_only %>% gt() %>% fmt_number(columns = c(5:10), decimals = 1)
| Year | Nb_LD_est | Nb_LD_lower | Nb_LD_upper | Nb_SA_est | Nb_SA_lower | Nb_SA_upper | Nb_COM_est | Nb_COM_lower | Nb_COM_upper |
|---|---|---|---|---|---|---|---|---|---|
| 2010 | 61.2 | 31.9 | 257.4 | 96.0 | 58.0 | 194.0 | 74.7 | 54.9 | 94.5 |
| 2011 | 205.7 | 109.8 | 1023.9 | 234.0 | 144.0 | 492.0 | 222.5 | 169.2 | 275.9 |
| 2012 | 214.8 | 134.0 | 489.2 | 196.0 | 133.0 | 319.0 | 199.1 | 160.8 | 237.5 |
| 2013 | 171.9 | 117.3 | 301.7 | 152.0 | 109.0 | 221.0 | 154.2 | 128.9 | 179.5 |
| 2014 | 103.3 | 48.8 | 10445.4 | 127.0 | 71.0 | 358.0 | 114.3 | 81.4 | 147.2 |
Outputs of NeEstimator and of Colony are saved in Report/NbResults.xlsx , where jacknife CIs are reported for pcrit = 0.05.
Plot of Nb estimates with 95% confidence intervals
The combined estimates are also plotted here. The error bars in this case represent +- SD, not a confidence interval.
Plot of Combined Nb estimates +- SD
I take stability to mean that there are no significant differences between the Nb estimates for any years, this corresponds to the null hypothesis that Nb_2010 = Nb_2011 = Nb_2012 = Nb_2013.
The situation is very difficult to evaluate graphically. The point estimates for 2010 and 2014 sit outside the common region (dark grey), but the 2011-2013 estimates (and the entirity of their CIs) lie entirely within the CIs for other years. No hard conclusions can be drawn from examining the individual CIs alone.
LD estimates with 95% CIs and values which fall in all CIs marked as a grey rectangle
In order to examine this formally, we will need to construct pairwise tests. This needs to be done with \(r^{*}\) as LDNe confidence intervals do not have normal CIs. The confidence intervals for \(r^{*}\) in LDNe are approximately normal. This method of comparison is statistically consistent with manner used in the construction of the CIs [cite: Jones, Ovenden and Wang (2016)].
A slightly more accurate result could be obtained by pulling the results directly from the LDNe code, and making minor adjustments to account for the fact this is only approximately normal, however I do not believe this would make a difference in this case.
The following table summarises the pairwise comparisons of the Nb estimates in terms of Nb. The first two columns identify which years are being compared, the third and forth columns state the relevant \(r^{*}\) values.
Their difference and the estimate for the standard deviation of the estimate follow. The combined standard deviation is found by taking the square root of the sum the of the two variances. This assumes both estimates are independent. While it is likely these estimates are correlated with each other, we have no way of accurately estimating their covariance so I believe it is best to assume independence.
The z-score and p-value are based on the ordinary z-test for the difference of two normaly distributed variables. The adjusted p-values are adjusted using the Bonferonni [actually used Holm FIX] method on the basis that we are making 6 pairwise comparisons and we wish to keep our family-wise error rate below 0.05. None of the unadjusted p-values were significant at the 0.05 level anyway.
data$rstar_LD_lower <- Ne_to_r_star(data$Nb_LD_lower, data$S)
data$rstar_LD_upper <- Ne_to_r_star(data$Nb_LD_upper, data$S)
data$rstar_LD_est <- Ne_to_r_star(data$Nb_LD_est, data$S)
data$rstar_sd_mean <- (data$rstar_LD_lower - data$rstar_LD_upper) / (2*qnorm(0.9750))
results_frame <- data.frame("Year1" = NA, "Year2"= NA, "r1"= NA, "r2"= NA, "diff"= NA, "combined sd"= NA, "z-score"= NA, "p-value"= NA) %>% na.omit()
for(i in 1:(nrow(data)-1)){
for(j in (i+1):nrow(data)){
temp <- c(
data$Year[i],
data$Year[j],
data$rstar_LD_est[i],
data$rstar_LD_est[j],
data$rstar_LD_est[i]-data$rstar_LD_est[j],
sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2),
(data$rstar_LD_est[i]-data$rstar_LD_est[j]) / sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2),
1-pnorm(abs(data$rstar_LD_est[i]-data$rstar_LD_est[j]) / sqrt(data$rstar_sd_mean[i]^2 + data$rstar_sd_mean[j]^2))
)
results_frame[nrow(results_frame) + 1,] <- temp
}
}
results_frame$p.value.adj <- p.adjust(results_frame$p.value, "holm")
results_frame %>% gt() %>% opt_stylize() %>% fmt_number(columns = c(7:9), decimals = 3) %>% fmt_number(columns = c(3:6), n_sigfig = 3)
| Year1 | Year2 | r1 | r2 | diff | combined.sd | z.score | p.value | p.value.adj |
|---|---|---|---|---|---|---|---|---|
| 2010 | 2011 | 0.000891 | 0.000269 | 0.000622 | 0.000391 | 1.592 | 0.056 | 0.501 |
| 2010 | 2012 | 0.000891 | 0.000258 | 0.000634 | 0.000382 | 1.661 | 0.048 | 0.484 |
| 2010 | 2013 | 0.000891 | 0.000322 | 0.000570 | 0.000381 | 1.495 | 0.067 | 0.539 |
| 2010 | 2014 | 0.000891 | 0.000531 | 0.000360 | 0.000468 | 0.768 | 0.221 | 1.000 |
| 2011 | 2012 | 0.000269 | 0.000258 | 0.0000113 | 0.000137 | 0.082 | 0.467 | 1.000 |
| 2011 | 2013 | 0.000269 | 0.000322 | −0.0000527 | 0.000135 | −0.389 | 0.349 | 1.000 |
| 2011 | 2014 | 0.000269 | 0.000531 | −0.000263 | 0.000304 | −0.864 | 0.194 | 1.000 |
| 2012 | 2013 | 0.000258 | 0.000322 | −0.0000640 | 0.000106 | −0.607 | 0.272 | 1.000 |
| 2012 | 2014 | 0.000258 | 0.000531 | −0.000274 | 0.000292 | −0.938 | 0.174 | 1.000 |
| 2013 | 2014 | 0.000322 | 0.000531 | −0.000210 | 0.000291 | −0.721 | 0.235 | 1.000 |
With the updated results, the intervals for the SA method are no longer mutually disjoint. Analysis is much the same as for LD now.
The analysis proceeds in a similar manner as for \(r^{*}\).
data$Nb_SA_est_inv <- 1 / (2*data$Nb_SA_est)
data$vstar_lower <- ((1/data$Nb_SA_lower - 1/data$Nb_SA_est) / (2*qnorm(0.9750)))^2
data$vstar_upper <- ((1/data$Nb_SA_est - 1/data$Nb_SA_upper) / (2*qnorm(0.9750)))^2
data$vstar_mean <- (data$vstar_upper+data$vstar_lower)/2
results_frame <- data.frame("Year1" = NA, "Year2"= NA, "2Nb^-1 1"= NA, "2Nb^-1 2"= NA, "diff"= NA, "combined sd"= NA, "z-score"= NA, "p-value"= NA) %>% na.omit()
for(i in 1:(nrow(data)-1)){
for(j in (i+1):nrow(data)){
temp <- c(
data$Year[i],
data$Year[j],
data$Nb_SA_est_inv[i],
data$Nb_SA_est_inv[j],
data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j],
sqrt(data$vstar_mean[i] + data$vstar_mean[j]),
(data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j]) / sqrt(data$vstar_mean[i] + data$vstar_mean[j]),
1-pnorm(abs(data$Nb_SA_est_inv[i]-data$Nb_SA_est_inv[j]) / sqrt(data$vstar_mean[i] + data$vstar_mean[j]))
)
results_frame[nrow(results_frame) + 1,] <- temp
}
}
results_frame$p.value.adj <- p.adjust(results_frame$p.value, "holm")
names(results_frame)[3:4] <-c('(2Ne)^-1 1', '(2Ne)^-1 2')
results_frame %>% gt() %>% opt_stylize() %>% fmt_number(columns = c(6:9), decimals = 3) %>% fmt_number(columns = c(3:5), n_sigfig = 3)
| Year1 | Year2 | (2Ne)^-1 1 | (2Ne)^-1 2 | diff | combined.sd | z.score | p.value | p.value.adj |
|---|---|---|---|---|---|---|---|---|
| 2010 | 2011 | 0.00521 | 0.00214 | 0.00307 | 0.002 | 1.832 | 0.034 | 0.335 |
| 2010 | 2012 | 0.00521 | 0.00255 | 0.00266 | 0.002 | 1.608 | 0.054 | 0.486 |
| 2010 | 2013 | 0.00521 | 0.00329 | 0.00192 | 0.002 | 1.152 | 0.125 | 0.872 |
| 2010 | 2014 | 0.00521 | 0.00394 | 0.00127 | 0.002 | 0.599 | 0.275 | 0.920 |
| 2011 | 2012 | 0.00214 | 0.00255 | −0.000414 | 0.001 | −0.491 | 0.312 | 0.920 |
| 2011 | 2013 | 0.00214 | 0.00329 | −0.00115 | 0.001 | −1.329 | 0.092 | 0.735 |
| 2011 | 2014 | 0.00214 | 0.00394 | −0.00180 | 0.002 | −1.141 | 0.127 | 0.872 |
| 2012 | 2013 | 0.00255 | 0.00329 | −0.000738 | 0.001 | −0.901 | 0.184 | 0.920 |
| 2012 | 2014 | 0.00255 | 0.00394 | −0.00139 | 0.002 | −0.893 | 0.186 | 0.920 |
| 2013 | 2014 | 0.00329 | 0.00394 | −0.000648 | 0.002 | −0.414 | 0.340 | 0.920 |
After adjustment for multiple comparisons, there is still a significant difference between 2010 and 2011 (p = 0.045). The situation is somewhat confusing here as the estimates for 2011, 2012 and 2013 are not significantly different from 2010.
I had a number of reservations about this analysis, but I have included this analysis as I believe it is important.
These issues are:
I have also tested for (linear) trends. With only, five points the power to detect a trend is very low, but I think the exercise is worthwhile.
ols <- lm(Nb_LD_est ~ Year, data=data_est_only)
lmfit <- broom::augment(ols, se_fit = TRUE, interval='confidence')
Using the method of Luikart et al 2020 (i.e. ordinary least squares regression on the Nb_LD estimates) there is no significant trend (p-value = 0.848). We can also see this graphically as the CI for the trendline (grey shaded area) includes the possibility of a flat line (i.e. slope=0, no trend).
gls3 <- nlme::gls(Nb_LD_est ~ Year, correlation=corCAR1(0.5, form = ~ 1, fixed=TRUE), method='REML', verbose=TRUE, data=data)
# Design matrix for our observations ,,
xmat <- model.matrix(~ Year, data=data)
# Regression coefficients
betahat <- coef(gls3)
glsfit <- broom.mixed::augment(gls3)
Sigmahat <- vcov(gls3)
#https://fw8051statistics4ecologists.netlify.app/gls.html
# var/cov(beta0 + beta1*X)
varcovhat<-xmat%*%Sigmahat%*%t(xmat)
SEline <- sqrt(diag(varcovhat))
glsfit$.upper <- glsfit$.fitted + qnorm(0.975) * SEline
glsfit$.lower <- glsfit$.fitted - qnorm(0.975) * SEline
Nor is there a trend when using a GLS regression that incorporates an estimate of autocorrelation (p-value = 0.808). This model only accounts for correlation and is not the only possible way to do this. It doesn’t resolve any of the other issues such as asymetric CIs, or incorporate other known information such as the variances of each estimate. However these tests for a trend are likely suitable here as 4 points are very unlikely to show evidence of a trend regardless.
The same analysis can be repeated with the SA estimates.
ols <- lm(Nb_SA_est ~ Year, data=data_est_only)
lmfit <- broom::augment(ols, se_fit = TRUE, interval='confidence')
Using the method of Luikart et al 2020 (i.e. ordinary least squares regression on the Nb_LD estimates) there is no significant trend (p-value = 0.927).
gls3 <- nlme::gls(Nb_SA_est ~ Year, correlation=corCAR1(0.5, form = ~ 1, fixed=TRUE), method='REML', verbose=TRUE, data=data)
# Design matrix for our observations ,,
xmat <- model.matrix(~ Year, data=data)
# Regression coefficients
betahat <- coef(gls3)
glsfit <- broom.mixed::augment(gls3)
Sigmahat <- vcov(gls3)
#https://fw8051statistics4ecologists.netlify.app/gls.html
# var/cov(beta0 + beta1*X)
varcovhat<-xmat%*%Sigmahat%*%t(xmat)
SEline <- sqrt(diag(varcovhat))
glsfit$.upper <- glsfit$.fitted + qnorm(0.975) * SEline
glsfit$.lower <- glsfit$.fitted - qnorm(0.975) * SEline
Nor is there a trend when using a GLS regression that incorporates an estimate of autocorrelation (p-value = 0.896).
It’s stable
Plot of Nb estimates with 95% confidence intervals
R.version
## _
## platform x86_64-pc-linux-gnu
## arch x86_64
## os linux-gnu
## system x86_64, linux-gnu
## status
## major 4
## minor 3.3
## year 2024
## month 02
## day 29
## svn rev 86002
## language R
## version.string R version 4.3.3 (2024-02-29)
## nickname Angel Food Cake